Movie Rating Model and Predictor

Part 5: Modeling

At this stage, several linear regression models, based upon forward selection and backward elimination methods, were developed and evaluated to predict the popularity of a film. Popularity was defined in terms of box office success. Through statistical analysis the log of the number of IMDB votes was the best predictor of box office success, as such this was the response variable. The models and their model selection methods are: Table 1: Prediction Models
Model Model.Selection Data
Alpha Forward Selection Full model
Beta Forward Selection Full model with univariate outliers removed
Gamma Backward Elimination Full model
Delta Backward Elimination Full model, influential outliers removed

Model Selection

Both forward selection and backward elimination with p-values model selection techniques were used. The forward selection approach optimized adjusted r-squared; whereas the backward elimination method was based upon p-values.

Forward Selection

The forward selection process began with a null model then all variables were added to the model, one-by-one, and the model which provided the greatest improvement over the current best adjusted R-squared was selected. The process repeated with each variable that was not already in the model until all variables were analyzed. Only the models that improved adjusted r-squared were retained at each step.

Backward Elimination

The backward elimination approach began with the full model. A regression analysis was performed and the least significant predictor (that with the highest p-value) was removed from the model. This process repeated, removie only the most least significant predictor at each step, until all predictors had p-values below the present threshold.

Full Model Selection

In the prior section, association and correlation tests were conducted with a 95% confidence level. All categorical variables were significant (Table 2) and were included in the full model. The decision to remove a quantitative variable was based largely upon its correlation with the response variable (Table 3) and its correlations with other quantitative variables (Table 4). Table 5 lists the variables removed and the rationale.

Table 5: Variables excluded from full model
Type Variable Description Rationale
Categorical actor1 First main actor/actress in the abridged cast of the movie Not predictive without other data
Categorical actor2 Second main actor/actress in the abridged cast of the movie Not predictive without other data
Categorical actor3 Third main actor/actress in the abridged cast of the movie Not predictive without other data
Categorical actor4 Fourth main actor/actress in the abridged cast of the movie Not predictive without other data
Categorical actor5 Fifth main actor/actress in the abridged cast of the movie Not predictive without other data
Categorical director Director of the movie Not predictive without other data
Categorical dvd_rel_day Day of the month the movie is released on DVD No predictive value
Categorical dvd_rel_month Month the movie is released on DVD No predictive value
Categorical dvd_rel_year Year the movie is released on DVD No predictive value
Categorical imdb_url Link to IMDB page for the movie No predictive value
Categorical rt_url Link to Rotten Tomatoes page for the movie No predictive value
Categorical thtr_rel_date Date the movie is released in theaters No predictive value
Categorical thtr_rel_day Day of the month the movie is released in theaters No predictive value
Categorical thtr_rel_year Year the movie is released in theaters No predictive value
Categorical title Title of movie No predictive value
Categorical title_type Type of movie (Documentary, Feature Film, TV Movie) Highly correlated with genre
Numeric daily_box_office Daily box office revenue from BoxOfficeMojo.com Not in data set
Numeric imdb_num_votes Number of votes on IMDB Highly correlated with imdb_num_votes_log
Numeric votes_per_day The number of IMDB Votes / thtr_days Highly correlated with votes_per_day_scores_log
Numeric votes_per_day_log Log of votes_per_day Highly correlated with votes_per_day_scores_log
Thus, the full model is presented in Table 6.
Type Variable Description
Categorical audience_rating Categorical variable for audience rating on Rotten Tomatoes (Spilled, Upright)
Categorical best_actor_win Whether or not one of the main actors in the movie ever won an Oscar (no, yes) – note that this is not necessarily whether the actor won an Oscar for their role in the given movie
Categorical best_actress_win Whether or not one of the main actresses in the movie ever won an Oscar (no, yes) – not that this is not necessarily whether the actresses won an Oscar for their role in the given movie
Categorical best_dir_win Whether or not the director of the movie ever won an Oscar (no, yes) – not that this is not necessarily whether the director won an Oscar for the given movie
Categorical best_pic_nom Whether or not the movie was nominated for a best picture Oscar (no, yes)
Categorical best_pic_win Whether or not the movie won a best picture Oscar (no, yes)
Categorical critics_rating Categorical variable for critics rating on Rotten Tomatoes (Certified Fresh, Fresh, Rotten)
Categorical genre Genre of movie (Action & Adventure, Comedy, Documentary, Drama, Horror, Mystery & Suspense, Other)
Categorical mpaa_rating MPAA rating of the movie (G, PG, PG-13, R, Unrated)
Categorical studio The studio that produced the film
Categorical thtr_rel_month Month the movie is released in theaters
Categorical thtr_rel_season Season the movie was released in theaters
Categorical top200_box Whether or not the movie is in the Top 200 Box Office list on BoxOfficeMojo (no, yes)
Numeric audience_score Audience score on Rotten Tomatoes
Numeric cast_experience The sum across all cast members for a film, of the number of films in which each actor appeared
Numeric cast_experience_log Log of the sum across all cast members for a film, of the number of films in which each actor appeared
Numeric cast_votes Total number of allocated IMDB votes per day for the cast of a film
Numeric cast_votes_log Log of cast_votes
Numeric critics_score Critics score on Rotten Tomatoes
Numeric daily_box_office_log Log of Box office revenue from BoxOfficeMojo.com
Numeric director_experience Total number of films in sample for a director
Numeric director_experience_log Log of the total number of films directed by the film’s director
Numeric imdb_num_votes_log Log number of IMDB votes
Numeric imdb_rating Rating on IMDB
Numeric runtime Runtime of movie (in minutes)
Numeric runtime_log Log runtime of movie (in minutes)
Numeric thtr_days Number of days from theatre release date to January 1, 2016
Numeric thtr_days_log Log of thtr_days

Model Alpha

For this model, a forward selection procedure was undertaken based upon the full model described above. The variables were added as described in Table 7.

Table 7: Model Alpha forward selection process
Step Selected Model.Size DF F.statistic R.Squared Adjusted.R2 p.value Pct Chg
1 cast_votes_log 1 2 489 536.66 0.52 0.52 0 0.00
2 genre 2 12 479 56.46 0.56 0.56 0 6.32
3 critics_score 3 13 478 59.88 0.60 0.59 0 6.31
4 best_pic_win 4 14 477 57.33 0.61 0.60 0 1.53
5 cast_experience_log 5 15 476 55.37 0.62 0.61 0 1.50
6 runtime_log 6 16 475 53.12 0.63 0.62 0 1.15
7 director_experience_log 7 17 474 50.60 0.63 0.62 0 0.49
8 thtr_rel_month 8 28 463 30.56 0.64 0.62 0 0.32
9 best_pic_nom 9 29 462 29.62 0.64 0.62 0 0.16

Model Overview

This model is defined as follows: \[y_i = (\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \beta_3 x_{i3} + \beta_4 x_{i4} + \beta_5 x_{i5}) + \epsilon_i\]

where:
\(y_i\) is the log number of votes for movie \(i\)
\(x_{i1}\) is log of cast_votes for movie \(i\)
\(x_{i2}\) is genre of movie (action & adventure, comedy, documentary, drama, horror, mystery & suspense, other) for movie \(i\)
\(x_{i3}\) is critics score on rotten tomatoes for movie \(i\)
\(x_{i4}\) is whether or not the movie won a best picture oscar (no, yes) for movie \(i\)
\(x_{i5}\) is log of the sum across all cast members for a film, of the number of films in which each actor appeared for movie \(i\)
\(x_{i6}\) is log runtime of movie (in minutes) for movie \(i\) \(x_{i7}\) is log of the total number of films directed by the film’s director for movie \(i\) \(x_{i8}\) is month the movie is released in theaters for movie \(i\)

\(\epsilon_i\) is the total residual for the model for movie\(i\)

As suggested by Figure 1, the model was significant (F(29, 462) = 29.625, p < .001), with an adjusted R-squared of 0.621.

Figure 1 Model Alpha Regression

Analysis of Variance

Figure 2 summarizes the analysis of variance.
Term Df Sum Sq Mean Sq F Statistic Pr(>F) % Var
cast_votes_log 1 1399.493 1399.493 675.751 0.000 52.32
genre 10 110.584 11.058 5.340 0.000 4.13
critics_score 1 96.140 96.140 46.422 0.000 3.59
best_pic_win 1 24.689 24.689 11.921 0.001 0.92
cast_experience_log 1 26.277 26.277 12.688 0.000 0.98
runtime_log 1 18.511 18.511 8.938 0.003 0.69
director_experience_log 1 11.255 11.255 5.435 0.020 0.42
thtr_rel_month 11 26.295 2.390 1.154 0.317 0.98
best_pic_nom 1 4.644 4.644 2.242 0.135 0.17
Residuals 462 956.811 2.071 NA NA 35.77

Figure 2 Model Alpha analysis of variance

A two-way analysis of variance was conducted on the influence of 9 independent variables on the log number of IMDB votes. The effect of cast_votes_log on the log of IMDB votes presented an F statistic of F(1, 462), = 675.751, p < .001, exhibiting 52.32% of the variance. The significance of genre on the log of IMDB votes presented an F statistic of F(10, 462), = 5.34, p < .001, representing 4.13% of the variance. The influence of critics_score on the log of IMDB votes indicated an F statistic of F(1, 462), = 46.422, p < .001, accounting for 3.59% of the variance. The influence of best_pic_win on the log of IMDB votes presented an F statistic of F(1, 462), = 11.921, p < .001, accounting for 0.92% of the variance. The force of cast_experience_log on the log of IMDB votes indicated an F statistic of F(1, 462), = 12.688, p < .001, exhibiting 0.98% of the variance. The significance of runtime_log on the log of IMDB votes yielded an F statistic of F(1, 462), = 8.938, p < .01, representing 0.69% of the variance. The significance of director_experience_log on the log of IMDB votes yielded an F statistic of F(1, 462), = 5.435, p < .05, representing 0.42% of the variance. The force of thtr_rel_month on the log of IMDB votes presented an F statistic of F(11, 462), = 1.154, p < 0.317, representing 0.98% of the variance. The effect of best_pic_nom on the log of IMDB votes yielded an F statistic of F(1, 462), = 2.242, p < 0.135, accounting for 0.17% of the variance. Finally, residuals exhibited approximately 35.77% of variance.

Model Coefficients

Table 8 Model Alpha Coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.932 2.181 4.553 0.000
cast_votes_log 0.589 0.029 20.021 0.000
genreAnimation -1.422 0.644 -2.210 0.028
genreArt House & International -2.225 0.489 -4.551 0.000
genreComedy -0.976 0.283 -3.453 0.001
genreDocumentary -2.633 0.368 -7.156 0.000
genreDrama -1.508 0.243 -6.218 0.000
genreHorror -0.583 0.418 -1.395 0.164
genreMusical & Performing Arts -1.929 0.539 -3.579 0.000
genreMystery & Suspense -1.039 0.318 -3.271 0.001
genreOther -1.564 0.465 -3.366 0.001
genreScience Fiction & Fantasy -0.231 0.559 -0.413 0.680
critics_score 0.014 0.003 5.043 0.000
best_pic_winyes 1.128 0.665 1.696 0.091
cast_experience_log -0.841 0.207 -4.067 0.000
runtime_log 0.676 0.330 2.049 0.041
director_experience_log 0.289 0.117 2.460 0.014
thtr_rel_monthFeb -0.038 0.371 -0.103 0.918
thtr_rel_monthMar 0.320 0.323 0.991 0.322
thtr_rel_monthApr -0.043 0.324 -0.132 0.895
thtr_rel_monthMay -0.033 0.328 -0.101 0.919
thtr_rel_monthJun 0.233 0.289 0.808 0.419
thtr_rel_monthJul 0.712 0.326 2.186 0.029
thtr_rel_monthAug 0.564 0.342 1.650 0.100
thtr_rel_monthSep -0.024 0.314 -0.076 0.939
thtr_rel_monthOct 0.124 0.294 0.422 0.673
thtr_rel_monthNov 0.311 0.326 0.954 0.341
thtr_rel_monthDec 0.394 0.306 1.289 0.198
best_pic_nomyes 0.644 0.430 1.497 0.135

As shown in Table 8, the predicted log of the number of IMDB votes for a film was 9.932 + a genre factor associated with the genre of the film + plus 0.014 log votes for each point of the critics score 1.128 log votes for log number of films in which the cast appears + -0.841 log votes for each minute of movie runtime. In addition, 0.676 log votes if the fil was nominated for a best picture oscar + 0.289 log votes for each film directed by the director of the film. An additional 0.644 log votes if the film received a best picture oscar. Additional log votes are added (or subtract) based upon the month it was released. The genres, in order of contribution to the log number of IMDB votes estimated, are:

Table 9 Model Alpha Genres and Popularity
Estimate Std. Error t value Pr(>|t|) coef
-0.231 0.559 -0.413 0.680 genreScience Fiction & Fantasy
-0.583 0.418 -1.395 0.164 genreHorror
-0.976 0.283 -3.453 0.001 genreComedy
-1.039 0.318 -3.271 0.001 genreMystery & Suspense
-1.422 0.644 -2.210 0.028 genreAnimation
-1.508 0.243 -6.218 0.000 genreDrama
-1.564 0.465 -3.366 0.001 genreOther
-1.929 0.539 -3.579 0.000 genreMusical & Performing Arts
-2.225 0.489 -4.551 0.000 genreArt House & International
-2.633 0.368 -7.156 0.000 genreDocumentary

According to this model, Science Fiction and Fantasy, Horror and Comedy films are the most popular.

Table 10 Model Alpha Timing and Popularity
Estimate Std. Error t value Pr(>|t|) coef
0.712 0.326 2.186 0.029 thtr_rel_monthJul
0.564 0.342 1.650 0.100 thtr_rel_monthAug
0.394 0.306 1.289 0.198 thtr_rel_monthDec
0.320 0.323 0.991 0.322 thtr_rel_monthMar
0.311 0.326 0.954 0.341 thtr_rel_monthNov
0.233 0.289 0.808 0.419 thtr_rel_monthJun
0.124 0.294 0.422 0.673 thtr_rel_monthOct
-0.024 0.314 -0.076 0.939 thtr_rel_monthSep
-0.033 0.328 -0.101 0.919 thtr_rel_monthMay
-0.038 0.371 -0.103 0.918 thtr_rel_monthFeb
-0.043 0.324 -0.132 0.895 thtr_rel_monthApr

As indicated in Table 10, mid to late summer and November are the best months in which to launch a film.

Model Diagnostics

Linearity

The linearity of each predictor with the log number of IMDB votes is illustrated in Figure 3.

Figure 3 Model Alpha linearity plots

A review of the partial scatterplots indicated that linearity was a reasonable assumption for this model (despite the presence of several influential points). A linear hypothesis test was conducted to test the linearity assumption. The results were significant (F(29), p < .001). As such, the linearity assumption was met in this case.

Homoscedasticity

The following plot (Figure 4) of the residuals versus the fitted values provides a graphic indication of the distribution of residual variances. Figure 4 Model Alpha homoscedasticity plot

The residuals plot above indicated equal dispersion of residuals about zero mean. A Breusch–Pagan test test was conducted to test the homoscedasticity assumption. The results were significant (F(1), p < .01). As such, the homoscedasticity assumption was met in this case.

Residuals

The histogram and the normal Q-Q plot in Figure 5 illustrate the distribution of residuals.

Figure 5 Model Alpha residuals plot

The histogram and normal Q-Q plot suggested a normal distribution of residuals. A review of the Shapiroi-Wilk test (SW = 0.986, p = 0) and the skewness (-0.463) and kurtosis (3.194) supported the assumption of normaility.

Multicollinearity

As shown in Figure 6 and Table 11, collinearity did not appear extant for this model. Variance inflation factors were computed for each predictor in the model. The maximum VIF of 2 did not exceed the threshold of 4. As such, the absense of multicollinearity was assumed for this model.
Figure 6: Model Alpha correlations among quantitative predictors

Table 11 Model Alpha variance inflation Factors
GVIF Df GVIF^(1/(2*Df))
cast_votes_log 1.612 1 1.270
genre 2.139 10 1.039
critics_score 1.365 1 1.168
best_pic_win 1.475 1 1.215
cast_experience_log 1.632 1 1.278
runtime_log 1.427 1 1.195
director_experience_log 1.160 1 1.077
thtr_rel_month 1.532 11 1.020
best_pic_nom 1.549 1 1.245

Outliers

Figure 7 Model Alpha Outliers

Examination of the residuals versus leverage plot and case-wise diagnostics such as Cook’s distance revealed 22 cases exerting undue influence on the model. The discern the effect of these outliers on the model, a new model (Model B) was created without the outliers removed.

Model Beta

For this model, a forward selection procedure was undertaken based upon the full model with outliers from Model Alpha removed. The variables were added as described in Table 12

Table 12: Model Beta forward selection process
Step Selected Model.Size DF F.statistic R.Squared Adjusted.R2 p.value Pct Chg
1 cast_votes_log 1 2 467 651.36 0.58 0.58 0 0.00
2 genre 2 12 457 67.79 0.62 0.61 0 4.98
3 critics_score 3 13 456 71.52 0.65 0.64 0 5.40
4 cast_experience_log 4 14 455 68.77 0.66 0.65 0 1.40
5 best_pic_win 5 15 454 66.67 0.67 0.66 0 1.53
6 runtime_log 6 16 453 64.20 0.68 0.67 0 1.06
7 director_experience_log 7 17 452 61.51 0.68 0.67 0 0.60
8 mpaa_rating 8 21 448 49.56 0.69 0.68 0 0.15
9 thtr_rel_month 9 32 437 32.50 0.70 0.68 0 0.15
10 best_actress_win 10 33 436 31.62 0.70 0.68 0 0.15

Model Overview

This model is defined as follows: \[y_i = (\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \beta_3 x_{i3} + \beta_4 x_{i4} + \beta_5 x_{i5}+ \beta_5 x_{i6}+ \beta_5 x_{i7}+ \beta_5 x_{i8}+ \beta_5 x_{i9}+ \beta_5 x_{i10}) + \epsilon_i\]

where:
\(y_i\) is the log number of votes for movie \(i\)
\(x_{i1}\) is log of cast_votes for movie \(i\)
\(x_{i2}\) is genre of movie (action & adventure, comedy, documentary, drama, horror, mystery & suspense, other) for movie \(i\)
\(x_{i3}\) is critics score on rotten tomatoes for movie \(i\)
\(x_{i4}\) is log of the sum across all cast members for a film, of the number of films in which each actor appeared for movie \(i\)
\(x_{i5}\) is whether or not the movie won a best picture oscar (no, yes) for movie \(i\)
\(x_{i6}\) is log runtime of movie (in minutes) for movie \(i\)
\(x_{i7}\) is log of the total number of films directed by the film’s director for movie \(i\)
\(x_{i8}\) is mpaa rating of the movie (g, pg, pg-13, r, unrated) for movie \(i\)

\(\epsilon_i\) is the total residual for the model for movie\(i\)

As suggested by Figure 8, the model was significant (F(33, 436) = 31.618, p < .001), with an adjusted R-squared of 0.677.

Figure 8 Model Beta Regression

Analysis of Variance

Figure 9 summarizes the analysis of variance.
Term Df Sum Sq Mean Sq F Statistic Pr(>F) % Var
cast_votes_log 1 1411.992 1411.992 843.213 0.000 58.24
genre 10 91.169 9.117 5.444 0.000 3.76
critics_score 1 80.036 80.036 47.796 0.000 3.30
cast_experience_log 1 23.489 23.489 14.027 0.000 0.97
best_pic_win 1 24.312 24.312 14.519 0.000 1.00
runtime_log 1 17.794 17.794 10.626 0.001 0.73
director_experience_log 1 12.500 12.500 7.464 0.007 0.52
mpaa_rating 4 8.385 2.096 1.252 0.288 0.35
thtr_rel_month 11 21.177 1.925 1.150 0.321 0.87
best_actress_win 1 3.390 3.390 2.025 0.155 0.14
Residuals 436 730.099 1.675 NA NA 30.12

Figure 9 Model Beta analysis of variance

A two-way analysis of variance was conducted on the influence of 10 independent variables on the log number of IMDB votes. The influence of cast_votes_log on the log of IMDB votes yielded an F statistic of F(1, 436), = 843.213, p < .001, accounting for 58.24% of the variance. The force of genre on the log of IMDB votes produced an F statistic of F(10, 436), = 5.444, p < .001, exhibiting 3.76% of the variance. The significance of critics_score on the log of IMDB votes yielded an F statistic of F(1, 436), = 47.796, p < .001, exhibiting 3.3% of the variance. The effect of cast_experience_log on the log of IMDB votes yielded an F statistic of F(1, 436), = 14.027, p < .001, exhibiting 0.97% of the variance. The significance of best_pic_win on the log of IMDB votes indicated an F statistic of F(1, 436), = 14.519, p < .001, accounting for 1% of the variance. The force of runtime_log on the log of IMDB votes produced an F statistic of F(1, 436), = 10.626, p < .01, accounting for 0.73% of the variance. The force of director_experience_log on the log of IMDB votes yielded an F statistic of F(1, 436), = 7.464, p < .01, accounting for 0.52% of the variance. The force of mpaa_rating on the log of IMDB votes presented an F statistic of F(4, 436), = 1.252, p < 0.288, representing 0.35% of the variance. The influence of thtr_rel_month on the log of IMDB votes yielded an F statistic of F(11, 436), = 1.15, p < 0.321, representing 0.87% of the variance. The force of best_actress_win on the log of IMDB votes produced an F statistic of F(1, 436), = 2.025, p < 0.155, representing 0.14% of the variance. Finally, residuals exhibited some 30.12% of variance.

Model Coefficients

Table 13 Model Beta Coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.911 2.043 4.361 0.000
cast_votes_log 0.609 0.028 21.379 0.000
genreAnimation -1.351 0.742 -1.820 0.069
genreArt House & International -1.543 0.522 -2.955 0.003
genreComedy -0.800 0.260 -3.077 0.002
genreDocumentary -2.145 0.376 -5.710 0.000
genreDrama -1.368 0.224 -6.097 0.000
genreHorror -0.384 0.384 -0.999 0.318
genreMusical & Performing Arts -1.807 0.539 -3.351 0.001
genreMystery & Suspense -0.875 0.297 -2.949 0.003
genreOther -1.427 0.488 -2.922 0.004
genreScience Fiction & Fantasy -0.159 0.504 -0.316 0.752
critics_score 0.014 0.003 5.329 0.000
cast_experience_log -0.788 0.193 -4.073 0.000
best_pic_winyes 1.665 0.523 3.185 0.002
runtime_log 0.825 0.308 2.679 0.008
director_experience_log 0.260 0.107 2.425 0.016
mpaa_ratingPG -0.277 0.429 -0.645 0.519
mpaa_ratingPG-13 -0.041 0.440 -0.092 0.927
mpaa_ratingR -0.273 0.420 -0.649 0.517
mpaa_ratingUnrated -0.756 0.511 -1.480 0.140
thtr_rel_monthFeb 0.065 0.351 0.184 0.854
thtr_rel_monthMar 0.424 0.302 1.403 0.161
thtr_rel_monthApr 0.048 0.299 0.161 0.872
thtr_rel_monthMay 0.183 0.308 0.595 0.552
thtr_rel_monthJun 0.199 0.267 0.747 0.455
thtr_rel_monthJul 0.599 0.297 2.017 0.044
thtr_rel_monthAug 0.699 0.315 2.217 0.027
thtr_rel_monthSep -0.049 0.291 -0.168 0.866
thtr_rel_monthOct 0.079 0.267 0.297 0.767
thtr_rel_monthNov 0.291 0.300 0.970 0.333
thtr_rel_monthDec 0.269 0.278 0.965 0.335
best_actress_winyes -0.281 0.198 -1.423 0.155

As shown in Table 13, the predicted log of the number of IMDB votes for a film was 8.911 + 0.609 log votes for each log vote earned by a cast member. A factor is added for the various genres as follows:

Table 14 Model Beta Genres and Popularity
Estimate Std. Error t value Pr(>|t|) coef
-0.159 0.504 -0.316 0.752 genreScience Fiction & Fantasy
-0.384 0.384 -0.999 0.318 genreHorror
-0.800 0.260 -3.077 0.002 genreComedy
-0.875 0.297 -2.949 0.003 genreMystery & Suspense
-1.351 0.742 -1.820 0.069 genreAnimation
-1.368 0.224 -6.097 0.000 genreDrama
-1.427 0.488 -2.922 0.004 genreOther
-1.543 0.522 -2.955 0.003 genreArt House & International
-1.807 0.539 -3.351 0.001 genreMusical & Performing Arts
-2.145 0.376 -5.710 0.000 genreDocumentary

It should be noted that most, not all estimates were statistically significant.

An additional 0.014 log votes are earned for each point of the critics score -0.788 log votes for log number of the sum of films in which the top 5 cast members appeared, + 1.665 log votes for each log minute of runtime, plus 1.665 log votes if the film won best picture, + 0.26 log votes for the log of the number of films that the director had directed, + -0.281 log votes if the film was nominated for best picture. The estimate of the log number of IMDB votes is adjusted according to the month of release as follows:

Table 15 Model Beta Timing and Popularity
Estimate Std. Error t value Pr(>|t|) coef
0.699 0.315 2.217 0.027 thtr_rel_monthAug
0.599 0.297 2.017 0.044 thtr_rel_monthJul
0.424 0.302 1.403 0.161 thtr_rel_monthMar
0.291 0.300 0.970 0.333 thtr_rel_monthNov
0.269 0.278 0.965 0.335 thtr_rel_monthDec
0.199 0.267 0.747 0.455 thtr_rel_monthJun
0.183 0.308 0.595 0.552 thtr_rel_monthMay
0.079 0.267 0.297 0.767 thtr_rel_monthOct
0.065 0.351 0.184 0.854 thtr_rel_monthFeb
0.048 0.299 0.161 0.872 thtr_rel_monthApr
-0.049 0.291 -0.168 0.866 thtr_rel_monthSep

In should be known; however, that many of the estimates for the months were not significant.

Lastly, an adjustment to the estimate is made for the MPAA rating.

Model Diagnostics

Linearity

The linearity of each predictor with the log number of IMDB votes is illustrated in Figure 10.

Figure 10 Model beta linearity plots

A review of the partial scatterplots indicated that linearity was a reasonable assumption for this model (despite the presence of several influential points). A linear hypothesis test was conducted to test the linearity assumption. The results were significant (F(33), p < .001). As such, the linearity assumption was met in this case.

Homoscedasticity

The following plot (Figure 11) of the residuals versus the fitted values provides a graphic indication of the distribution of residual variances. Figure 11 Model beta homoscedasticity plot

The residuals plot above indicated equal dispersion of residuals about zero mean. A Breusch–Pagan test test was conducted to test the homoscedasticity assumption. The results were significant (F(1), p < .05). As such, the homoscedasticity assumption was met in this case.

Residuals

The histogram and the normal Q-Q plot in Figure 12 illustrate the distribution of residuals.

Figure 12 Model beta residuals plot

The histogram and normal Q-Q plot suggested a normal distribution of residuals. A review of the Shapiroi-Wilk test (SW = 0.991, p = 0.007) and the skewness (-0.297) and kurtosis (2.842) supported the assumption of normaility.

Multicollinearity

As shown in Figure 13 and Table 16, collinearity appeared extant for this moiiidel. Variance inflation factors were computed for each predictor in the model. The maximum VIF of 4 exceeded the threshold of 4. As such, the correlation among the predictors would require further consideration. The multicollinearity assumption was not met for this model.
Figure 13: Correlations among quantitative predictors

Table 16 Model Beta Variance Inflation Factors
GVIF Df GVIF^(1/(2*Df))
cast_votes_log 1.752 1 1.324
genre 4.060 10 1.073
critics_score 1.435 1 1.198
cast_experience_log 1.707 1 1.306
best_pic_win 1.126 1 1.061
runtime_log 1.488 1 1.220
director_experience_log 1.160 1 1.077
mpaa_rating 2.835 4 1.139
thtr_rel_month 1.658 11 1.023
best_actress_win 1.205 1 1.098

Outliers

Figure 14 Model Beta Outliers

Examination of the residuals versus leverage plot and case-wise diagnostics such as Cook’s distance revealed 19 cases exerting undue influence on the model.

Model Gamma

For this model, a backward elimination procedure was undertaken based upon the full model The variables were removed as described in Table 17

Table 17: Model Gamma
Steps Removed p.value
1 best_actress_win 0.53

Model Overview

This model is defined as follows: \[y_i = (\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \beta_3 x_{i3} + \beta_4 x_{i4} + \beta_5 x_{i5} + \beta_1 x_{i6} + \beta_2 x_{i7} + \beta_3 x_{i8} + \beta_4 x_{i9} + \beta_5 x_{i10} + \beta_5 x_{i11} + \beta_5 x_{i12} + \beta_5 x_{i13}) + \epsilon_i\]

where:
\(y_i\) is the log number of votes for movie \(i\)
\(x_{i1}\) is genre of movie (action & adventure, comedy, documentary, drama, horror, mystery & suspense, other) for movie \(i\) \(x_{i2}\) is mpaa rating of the movie (g, pg, pg-13, r, unrated) for movie \(i\) \(x_{i3}\) is month the movie is released in theaters for movie \(i\) \(x_{i4}\) is whether or not the movie was nominated for a best picture oscar (no, yes) for movie \(i\) \(x_{i5}\) is whether or not the movie won a best picture oscar (no, yes) for movie \(i\) \(x_{i6}\) is whether or not one of the main actors in the movie ever won an oscar (no, yes) – note that this is not necessarily whether the actor won an oscar for their role in the given movie for movie \(i\) \(x_{i7}\) is whether or not the director of the movie ever won an oscar (no, yes) – not that this is not necessarily whether the director won an oscar for the given movie for movie \(i\) \(x_{i8}\) is critics score on rotten tomatoes for movie \(i\) \(x_{i9}\) is number of days from theatre release date to january 1, 2,016 for movie \(i\) \(x_{i10}\) is log runtime of movie (in minutes) for movie \(i\) \(x_{i11}\) is log of cast_votes for movie \(i\) \(x_{i12}\) is log of the total number of films directed by the film’s director for movie \(i\) \(x_{i13}\) is log of the sum across all cast members for a film, of the number of films in which each actor appeared for movie \(i\) \(\epsilon_i\) is the total residual for the model for movie\(i\)

As suggested by Figure 15, the model was significant (F(36, 455) = 23.956, p < .001), with an adjusted R-squared of 0.621.

Figure 15 Model Gamma Regression

Analysis of Variance

Figure 16 summarizes the analysis of variance.
Term Df Sum Sq Mean Sq F Statistic Pr(>F) % Var
genre 10 376.261 37.626 18.196 0.000 14.07
mpaa_rating 4 99.942 24.986 12.083 0.000 3.74
thtr_rel_month 11 103.141 9.376 4.534 0.000 3.86
best_pic_nom 1 111.151 111.151 53.752 0.000 4.16
best_pic_win 1 26.967 26.967 13.041 0.000 1.01
best_actor_win 1 17.770 17.770 8.594 0.004 0.66
best_dir_win 1 25.382 25.382 12.275 0.001 0.95
critics_score 1 145.037 145.037 70.139 0.000 5.42
thtr_days 1 207.620 207.620 100.404 0.000 7.76
runtime_log 1 60.489 60.489 29.252 0.000 2.26
cast_votes_log 1 519.590 519.590 251.271 0.000 19.43
director_experience_log 1 9.104 9.104 4.403 0.036 0.34
cast_experience_log 1 31.373 31.373 15.172 0.000 1.17
Residuals 455 940.870 2.068 NA NA 35.18

Figure 16 Model Gamma analysis of variance

A two-way analysis of variance was conducted on the influence of 13 independent variables on the log number of IMDB votes. The effect of genre on the log of IMDB votes indicated an F statistic of F(10, 455), = 18.196, p < .001, accounting for 14.07% of the variance. The force of mpaa_rating on the log of IMDB votes indicated an F statistic of F(4, 455), = 12.083, p < .001, expressing 3.74% of the variance. The influence of thtr_rel_month on the log of IMDB votes presented an F statistic of F(11, 455), = 4.534, p < .001, expressing 3.86% of the variance. The force of best_pic_nom on the log of IMDB votes presented an F statistic of F(1, 455), = 53.752, p < .001, exhibiting 4.16% of the variance. The force of best_pic_win on the log of IMDB votes indicated an F statistic of F(1, 455), = 13.041, p < .001, accounting for 1.01% of the variance. The significance of best_actor_win on the log of IMDB votes indicated an F statistic of F(1, 455), = 8.594, p < .01, representing 0.66% of the variance. The force of best_dir_win on the log of IMDB votes yielded an F statistic of F(1, 455), = 12.275, p < .001, representing 0.95% of the variance. The force of critics_score on the log of IMDB votes yielded an F statistic of F(1, 455), = 70.139, p < .001, expressing 5.42% of the variance. The significance of thtr_days on the log of IMDB votes presented an F statistic of F(1, 455), = 100.404, p < .001, representing 7.76% of the variance. The influence of runtime_log on the log of IMDB votes indicated an F statistic of F(1, 455), = 29.252, p < .001, expressing 2.26% of the variance. The effect of cast_votes_log on the log of IMDB votes presented an F statistic of F(1, 455), = 251.271, p < .001, expressing 19.43% of the variance. The force of director_experience_log on the log of IMDB votes indicated an F statistic of F(1, 455), = 4.403, p < .05, accounting for 0.34% of the variance. The significance of cast_experience_log on the log of IMDB votes indicated an F statistic of F(1, 455), = 15.172, p < .001, exhibiting 1.17% of the variance. Finally, residuals expressed a 35.18% of variance.

Model Coefficients

Table 18 Model Gamma Coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) 10.042 2.284 4.398 0.000
genreAnimation -1.432 0.704 -2.036 0.042
genreArt House & International -2.088 0.501 -4.171 0.000
genreComedy -1.011 0.286 -3.535 0.000
genreDocumentary -2.484 0.411 -6.042 0.000
genreDrama -1.492 0.249 -6.006 0.000
genreHorror -0.466 0.426 -1.092 0.276
genreMusical & Performing Arts -1.954 0.545 -3.588 0.000
genreMystery & Suspense -0.989 0.322 -3.072 0.002
genreOther -1.474 0.471 -3.132 0.002
genreScience Fiction & Fantasy -0.278 0.560 -0.496 0.620
mpaa_ratingPG -0.129 0.461 -0.280 0.780
mpaa_ratingPG-13 0.135 0.483 0.279 0.780
mpaa_ratingR -0.154 0.459 -0.335 0.738
mpaa_ratingUnrated -0.601 0.566 -1.061 0.289
thtr_rel_monthFeb -0.015 0.371 -0.041 0.967
thtr_rel_monthMar 0.387 0.325 1.190 0.235
thtr_rel_monthApr -0.020 0.329 -0.059 0.953
thtr_rel_monthMay 0.019 0.331 0.058 0.953
thtr_rel_monthJun 0.288 0.293 0.983 0.326
thtr_rel_monthJul 0.725 0.329 2.204 0.028
thtr_rel_monthAug 0.534 0.346 1.544 0.123
thtr_rel_monthSep 0.075 0.320 0.235 0.814
thtr_rel_monthOct 0.157 0.299 0.525 0.600
thtr_rel_monthNov 0.388 0.331 1.172 0.242
thtr_rel_monthDec 0.434 0.307 1.412 0.159
best_pic_nomyes 0.701 0.436 1.606 0.109
best_pic_winyes 0.843 0.700 1.205 0.229
best_actor_winyes -0.194 0.209 -0.931 0.352
best_dir_winyes 0.429 0.289 1.485 0.138
critics_score 0.015 0.003 5.205 0.000
thtr_days 0.000 0.000 -0.600 0.549
runtime_log 0.680 0.348 1.956 0.051
cast_votes_log 0.569 0.037 15.535 0.000
director_experience_log 0.249 0.125 1.995 0.047
cast_experience_log -0.834 0.214 -3.895 0.000

The coefficients for this model were the same of those for model Beta; however, the order in which the variables were added was different as was the estimates.

Model Diagnostics

Linearity

The linearity of each predictor with the log number of IMDB votes is illustrated in Figure 17.

Figure 17 Model beta linearity plots

A review of the partial scatterplots indicated that linearity was a reasonable assumption for this model (despite the presence of several influential points). A linear hypothesis test was conducted to test the linearity assumption. The results were significant (F(36), p < .001). As such, the linearity assumption was met in this case.

Homoscedasticity

The following plot (Figure 18) of the residuals versus the fitted values provides a graphic indication of the distribution of residual variances. Figure 18 Model beta homoscedasticity plot

The residuals plot above indicated equal dispersion of residuals about zero mean. A Breusch–Pagan test test was conducted to test the homoscedasticity assumption. The results were significant (F(1), p < .01). As such, the homoscedasticity assumption was met in this case.

Residuals

The histogram and the normal Q-Q plot in Figure 19 illustrate the distribution of residuals.

Figure 19 Model beta residuals plot

The histogram and normal Q-Q plot suggested a normal distribution of residuals. A review of the Shapiroi-Wilk test (SW = 0.99, p = 0.001) and the skewness (-0.386) and kurtosis (3.139) supported the assumption of normaility.

Multicollinearity

As shown in Figure 20 and Table 19, collinearity appeared extant for this moiiidel. Variance inflation factors were computed for each predictor in the model. The maximum VIF of 4 exceeded the threshold of 4. As such, the correlation among the predictors would require further consideration. The multicollinearity assumption was not met for this model.
Figure 20: Correlations among quantitative predictors

Table 19 Model Gamma Variance Inflation Factors
GVIF Df GVIF^(1/(2*Df))
genre 4.405 10 1.077
mpaa_rating 3.173 4 1.155
thtr_rel_month 1.895 11 1.029
best_pic_nom 1.595 1 1.263
best_pic_win 1.636 1 1.279
best_actor_win 1.232 1 1.110
best_dir_win 1.381 1 1.175
critics_score 1.499 1 1.224
thtr_days 1.966 1 1.402
runtime_log 1.589 1 1.260
cast_votes_log 2.498 1 1.580
director_experience_log 1.313 1 1.146
cast_experience_log 1.755 1 1.325

Outliers

Figure 21 Model Gamma Outliers

Examination of the residuals versus leverage plot and case-wise diagnostics such as Cook’s distance revealed 20 cases exerting undue influence on the model. To discern the effect of the influential points on the model, a new model (Model Delta) was created without the influential points of this model.

Model Delta

For this model, a backward elimination procedure was undertaken based upon the full model The variables were removed as described in Table 20

Table 20: Model Delta
Steps Removed p.value
1 best_actress_win 0.52

Model Overview

This model is defined as follows: \[y_i = (\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \beta_3 x_{i3} + \beta_4 x_{i4} + \beta_5 x_{i5} + \beta_1 x_{i6} + \beta_2 x_{i7} + \beta_3 x_{i8} + \beta_4 x_{i9} + \beta_5 x_{i10}) + \beta_5 x_{i11}+ \beta_5 x_{i12}) + \epsilon_i\]

where:
\(\epsilon_i\) is the total residual for the model for movie\(i\)

As suggested by Figure 22, the model was significant (F(36, 435) = 28.66, p < .001), with an adjusted R-squared of 0.673.

Figure 22 Model Delta Regression

Analysis of Variance

Figure 23 summarizes the analysis of variance.
Term Df Sum Sq Mean Sq F Statistic Pr(>F) % Var
genre 10 368.668 36.867 21.669 0.000 15.07
mpaa_rating 4 113.484 28.371 16.676 0.000 4.64
thtr_rel_month 11 89.408 8.128 4.777 0.000 3.65
best_pic_nom 1 99.674 99.674 58.586 0.000 4.07
best_pic_win 1 31.941 31.941 18.774 0.000 1.31
best_actor_win 1 21.859 21.859 12.848 0.000 0.89
best_dir_win 1 23.668 23.668 13.911 0.000 0.97
critics_score 1 122.917 122.917 72.248 0.000 5.02
thtr_days 1 200.765 200.765 118.005 0.000 8.21
runtime_log 1 72.442 72.442 42.580 0.000 2.96
cast_votes_log 1 516.605 516.605 303.649 0.000 21.11
director_experience_log 1 11.883 11.883 6.985 0.009 0.49
cast_experience_log 1 33.313 33.313 19.580 0.000 1.36
Residuals 435 740.075 1.701 NA NA 30.25

Figure 23 Model Delta analysis of variance

A two-way analysis of variance was conducted on the influence of 13 independent variables on the log number of IMDB votes. The effect of genre on the log of IMDB votes presented an F statistic of F(10, 435), = 21.669, p < .001, expressing 15.07% of the variance. The influence of mpaa_rating on the log of IMDB votes presented an F statistic of F(4, 435), = 16.676, p < .001, expressing 4.64% of the variance. The effect of thtr_rel_month on the log of IMDB votes presented an F statistic of F(11, 435), = 4.777, p < .001, accounting for 3.65% of the variance. The force of best_pic_nom on the log of IMDB votes yielded an F statistic of F(1, 435), = 58.586, p < .001, representing 4.07% of the variance. The force of best_pic_win on the log of IMDB votes presented an F statistic of F(1, 435), = 18.774, p < .001, expressing 1.31% of the variance. The significance of best_actor_win on the log of IMDB votes yielded an F statistic of F(1, 435), = 12.848, p < .001, expressing 0.89% of the variance. The force of best_dir_win on the log of IMDB votes indicated an F statistic of F(1, 435), = 13.911, p < .001, exhibiting 0.97% of the variance. The influence of critics_score on the log of IMDB votes yielded an F statistic of F(1, 435), = 72.248, p < .001, expressing 5.02% of the variance. The force of thtr_days on the log of IMDB votes yielded an F statistic of F(1, 435), = 118.005, p < .001, expressing 8.21% of the variance. The effect of runtime_log on the log of IMDB votes presented an F statistic of F(1, 435), = 42.58, p < .001, accounting for 2.96% of the variance. The force of cast_votes_log on the log of IMDB votes yielded an F statistic of F(1, 435), = 303.649, p < .001, accounting for 21.11% of the variance. The effect of director_experience_log on the log of IMDB votes yielded an F statistic of F(1, 435), = 6.985, p < .01, representing 0.49% of the variance. The influence of cast_experience_log on the log of IMDB votes produced an F statistic of F(1, 435), = 19.58, p < .001, representing 1.36% of the variance. Finally, residuals represented some 30.25% of variance.

Model Coefficients

Table 21 Model Delta Coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.669 2.120 4.562 0.000
genreAnimation -1.144 0.763 -1.499 0.135
genreArt House & International -1.735 0.475 -3.654 0.000
genreComedy -0.914 0.263 -3.479 0.001
genreDocumentary -2.303 0.388 -5.941 0.000
genreDrama -1.452 0.227 -6.399 0.000
genreHorror -0.433 0.388 -1.116 0.265
genreMusical & Performing Arts -1.864 0.546 -3.415 0.001
genreMystery & Suspense -0.883 0.295 -2.989 0.003
genreOther -1.749 0.472 -3.705 0.000
genreScience Fiction & Fantasy -0.224 0.509 -0.440 0.660
mpaa_ratingPG -0.042 0.454 -0.092 0.926
mpaa_ratingPG-13 0.231 0.472 0.490 0.624
mpaa_ratingR -0.033 0.451 -0.073 0.942
mpaa_ratingUnrated -0.545 0.548 -0.995 0.320
thtr_rel_monthFeb -0.033 0.352 -0.095 0.925
thtr_rel_monthMar 0.458 0.302 1.515 0.130
thtr_rel_monthApr -0.093 0.302 -0.309 0.758
thtr_rel_monthMay 0.217 0.312 0.695 0.487
thtr_rel_monthJun 0.096 0.270 0.356 0.722
thtr_rel_monthJul 0.586 0.301 1.951 0.052
thtr_rel_monthAug 0.643 0.322 1.997 0.046
thtr_rel_monthSep -0.045 0.297 -0.152 0.879
thtr_rel_monthOct 0.037 0.274 0.137 0.891
thtr_rel_monthNov 0.166 0.304 0.547 0.585
thtr_rel_monthDec 0.233 0.282 0.827 0.409
best_pic_nomyes 0.500 0.412 1.216 0.225
best_pic_winyes 1.055 0.644 1.639 0.102
best_actor_winyes -0.103 0.195 -0.530 0.596
best_dir_winyes 0.350 0.263 1.328 0.185
critics_score 0.013 0.003 4.919 0.000
thtr_days 0.000 0.000 -0.241 0.809
runtime_log 0.741 0.322 2.298 0.022
cast_votes_log 0.593 0.035 17.057 0.000
director_experience_log 0.289 0.115 2.515 0.012
cast_experience_log -0.882 0.199 -4.425 0.000

The coefficients foer this model was the same as those of model Gamma. The order was different as were the estimates.

Model Diagnostics

Linearity

The linearity of each predictor with the log number of IMDB votes is illustrated in Figure 24.

Figure 24 Model beta linearity plots

A review of the partial scatterplots indicated that linearity was a reasonable assumption for this model (despite the presence of several influential points). A linear hypothesis test was conducted to test the linearity assumption. The results were significant (F(36), p < .001). As such, the linearity assumption was met in this case.

Homoscedasticity

The following plot (Figure 25) of the residuals versus the fitted values provides a graphic indication of the distribution of residual variances. Figure 25 Model beta homoscedasticity plot

The residuals plot above indicated equal dispersion of residuals about zero mean. A Breusch–Pagan test test was conducted to test the homoscedasticity assumption. The results were significant (F(1), p < .05). As such, the homoscedasticity assumption was met in this case.

Residuals

The histogram and the normal Q-Q plot in Figure 26 illustrate the distribution of residuals.

Figure 26 Model beta residuals plot

The histogram and normal Q-Q plot suggested a normal distribution of residuals. A review of the Shapiroi-Wilk test (SW = 0.992, p = 0.011) and the skewness (-0.281) and kurtosis (2.851) supported the assumption of normaility.

Multicollinearity

As shown in Figure 27 and Table 22, collinearity appeared extant for this moiiidel. Variance inflation factors were computed for each predictor in the model. The maximum VIF of 4 exceeded the threshold of 4. As such, the correlation among the predictors would require further consideration. The multicollinearity assumption was not met for this model.
Figure 27: Correlations among quantitative predictors

Table 22 Model Delta Variance Inflation Factors
GVIF Df GVIF^(1/(2*Df))
genre 4.481 10 1.078
mpaa_rating 3.226 4 1.158
thtr_rel_month 1.886 11 1.029
best_pic_nom 1.631 1 1.277
best_pic_win 1.679 1 1.296
best_actor_win 1.234 1 1.111
best_dir_win 1.388 1 1.178
critics_score 1.502 1 1.226
thtr_days 2.055 1 1.433
runtime_log 1.602 1 1.266
cast_votes_log 2.599 1 1.612
director_experience_log 1.318 1 1.148
cast_experience_log 1.759 1 1.326

Outliers

Figure 28 Model Delta Outliers

Examination of the residuals versus leverage plot and case-wise diagnostics such as Cook’s distance revealed 19 cases exerting undue influence on the model.

Model Comparisons

To summarize, models Alpha and Beta were constructed using forward selection and models Gamma and Delta were developed via backward elimination. Models Beta and Delta were fitted without the influential data points from models Alpha and Gamma respectively.

Table 23 Summary of models
Model Size df df Residuals F Statistic RMSE Residual SE R-Squared Adj R-Squared p-value % Variance
Model Alpha 9 29 462 29.625 1.439 1.439 0.642 0.621 0 64.227
Model Beta 10 33 436 31.618 1.294 1.294 0.699 0.677 0 69.885
Model Gamma 13 36 455 23.956 1.438 1.438 0.648 0.621 0 64.823
Model Delta 13 36 435 28.660 1.304 1.304 0.698 0.673 0 69.752

Forward Selection vs. Backward Elimination

As shown in Table 23, the forward selection algorithm produced fewer predictors than the backward elimination algorithm. Notwithstanding, the differences in root mean square error for the models was not significant -0.08% and -0.79%. Similarly, the differences in adjusted R-squared were -0.09% and -0.53%, not a significant difference. Lastly the differences in the percent variance explained by the models also lacking in significance (0.93% and -0.19%).

Drop or Not

The Beta and Delta models were trained on data sans the influential points from Alpha and Gamma. The differences in RMSE (11.21% and 10.25%) were somewhat significant, as were the differences in adjusted R-squared (9.05% and 8.37%), and the percent of variance explained (8.81% and 7.6%).

Prediction Accuracy

The evaluate the effects of model selection method and the treatment of outliers on prediction accuracy, the four multiregression models were evaluated for prediction accuracy on the test data. Four measures of prediction accuracy were used:

  1. MAPE - Mean Absolute Percentage Error
  2. MPE - Mean Percentage Error
  3. MSE - Mean Squared Error
  4. RMSE - Root Mean Squared Error

In addition, a percent accuracy measure was computed as the percentage of the observations in the test set in which the actual log number of IMDB votes fell within the prediction interval.

Table 24 Model Predictive Accuracy Summary
Model Size F Statistic R-Squared Adj R-Squared % Variance MAPE MPE MSE RMSE X..Accuracy
Model Alpha 9 29.625 0.642 0.621 64.227 8.415 -1.165 2.069 1.438 94.355
Model Beta 10 31.618 0.699 0.677 69.885 8.372 -1.578 2.156 1.468 93.548
Model Gamma 13 23.956 0.648 0.621 64.823 8.373 -0.894 2.072 1.439 96.774
Model Delta 13 28.660 0.698 0.673 69.752 8.444 -1.374 2.124 1.457 93.548

There were no significant differences in MAPE, MSE, and RMSE between the models. The negative MPE indicates that all models were biased with over predictions. From a percent accuracy perspective, it is worth noting that the forward selection and backward selection models performed identically with and without the influence points. Indeed, the Alpha and Gamma models performed equally well; however, the Alpha model was able to do so with just 4 fewer variables. Therefore, the most parsimonious model, Alpha would advance to the movie prediction stage.

The Model

The prediction equation was defined as follows: $y_i = $ 9.932304 + 0.589\(x_1\) + -1.422\(x_2\) + -2.225\(x_3\) + -0.976\(x_4\) + -2.633\(x_5\) + -1.508\(x_6\) + -0.583\(x_7\) + -1.929\(x_8\) + -1.039\(x_9\) + -1.564\(x_10\) + -0.231\(x_11\) + 0.014\(x_12\) + 1.128\(x_13\) + -0.841\(x_14\) + 0.676\(x_15\) + 0.289\(x_16\) + -0.038\(x_17\)

References

John James jjames@datasciencesalon.org

19 November, 2017